import pandas as pd
variants = pd.read_csv(
"https://raw.githubusercontent.com/nekrut/bda/main/data/pf_variants.tsv",
sep="\t"
)
variants.head()Take-home project 1
Write your PSU email address here:
Share the notebook with aun1@psu.edu
Load the data
Instructions
Our goal is to understand whether the malaria parasite (Plasmodium falciparum) infecting these individuals is resistant to Pyrimethamine—an antimalarial drug. Resistance to Pyrimethamine is conferred by a mutation in PF3D7_0417200 (dhfr) gene Cowman1988. Given sequencing data from four individuals we will determine which one of them is infected with a Plasmodium falciparum carrying mutations in this gene.
Variant calls in the provided Pandas data frame represent analysis of four samples: two from Ivory Coast and two from Colombia:
| Accession | Location |
|---|---|
| ERR636434 | Ivory coast |
| ERR636028 | Ivory coast |
| ERR042232 | Colombia |
| ERR042228 | Colombia |
These accessions correspond to datasets stored in the Sequence Read Archive at NCBI.
(data from MalariaGen )
Specifics
- Filter variants falling within the dhfr gene
- Restrict variants to missense variants only using the effect column.
- You are specifically interested in variant at amino acid position 108
- Create a graph that shows samples vs variant coordinates, in which graph marks are proportional to alternative allele frequencies (AF column)
- Create a graph showing a world map in which allele frequencies of these two samples are represented as pie charts within the map of Colombia and within the map of Ivory Coast. to be more specific, for each location you have two samples. Each of these samples will have an allele frequency at the resistance side. Use these allele frequencies as areas on the pie chart
You can use any AI you want (preferably the one integrated in Colab) but you will never get exactly what you want, so you will have to adjust it. You will have to explain to me what the steps were.